In [1]:
import datashader as ds
import datashader.transfer_functions as tf
import datashader.glyphs
from datashader import reductions
from datashader.core import bypixel
from datashader.utils import lnglat_to_meters as webm, export_image
from datashader.colors import colormap_select, Greys9, viridis, inferno
import copy

from pyproj import Proj, transform
import numpy as np
import pandas as pd
import urllib
import json
import datetime
import colorlover as cl

import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools

from shapely.geometry import Point, Polygon, shape
# In order to get shapley, you'll need to run [pip install shapely.geometry] from your terminal

from functools import partial

from IPython.display import GeoJSON

py.init_notebook_mode()

For module 2 we'll be looking at techniques for dealing with big data. In particular binning strategies and the datashader library (which possibly proves we'll never need to bin large data for visualization ever again.)

To demonstrate these concepts we'll be looking at the PLUTO dataset put out by New York City's department of city planning. PLUTO contains data about every tax lot in New York City.

PLUTO data can be downloaded from here. Unzip them to the same directory as this notebook, and you should be able to read them in using this (or very similar) code. Also take note of the data dictionary, it'll come in handy for this assignment.

In [2]:
# Code to read in v17, column names have been updated (without upper case letters) for v18

# bk = pd.read_csv('PLUTO17v1.1/BK2017V11.csv')
# bx = pd.read_csv('PLUTO17v1.1/BX2017V11.csv')
# mn = pd.read_csv('PLUTO17v1.1/MN2017V11.csv')
# qn = pd.read_csv('PLUTO17v1.1/QN2017V11.csv')
# si = pd.read_csv('PLUTO17v1.1/SI2017V11.csv')

# ny = pd.concat([bk, bx, mn, qn, si], ignore_index=True)

ny = pd.read_csv('pluto_20v1.csv')


# Getting rid of some outliers
ny = ny[(ny['yearbuilt'] > 1850) & (ny['yearbuilt'] < 2020) & (ny['numfloors'] != 0)]
C:\Users\spiec\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3058: DtypeWarning:

Columns (17,18,20,22) have mixed types. Specify dtype option on import or set low_memory=False.

I'll also do some prep for the geographic component of this data, which we'll be relying on for datashader.

You're not required to know how I'm retrieving the lattitude and longitude here, but for those interested: this dataset uses a flat x-y projection (assuming for a small enough area that the world is flat for easier calculations), and this needs to be projected back to traditional lattitude and longitude.

In [3]:
# wgs84 = Proj("+proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs")
# nyli = Proj("+proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs")
# ny['xcoord'] = 0.3048*ny['xcoord']
# ny['ycoord'] = 0.3048*ny['ycoord']
# ny['lon'], ny['lat'] = transform(nyli, wgs84, ny['xcoord'].values, ny['ycoord'].values)

# ny = ny[(ny['lon'] < -60) & (ny['lon'] > -100) & (ny['lat'] < 60) & (ny['lat'] > 20)]

#Defining some helper functions for DataShader
background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))

Part 1: Binning and Aggregation

Binning is a common strategy for visualizing large datasets. Binning is inherent to a few types of visualizations, such as histograms and 2D histograms (also check out their close relatives: 2D density plots and the more general form: heatmaps.

While these visualization types explicitly include binning, any type of visualization used with aggregated data can be looked at in the same way. For example, lets say we wanted to look at building construction over time. This would be best viewed as a line graph, but we can still think of our results as being binned by year:

In [4]:
trace = go.Scatter(
    # I'm choosing BBL here because I know it's a unique key.
    x = ny.groupby('yearbuilt').count()['bbl'].index,
    y = ny.groupby('yearbuilt').count()['bbl']
)

layout = go.Layout(
    xaxis = dict(title = 'Year Built'),
    yaxis = dict(title = 'Number of Lots Built')
)

fig = go.FigureWidget(data = [trace], layout = layout)

fig.show()

Something looks off... You're going to have to deal with this imperfect data to answer this first question.

But first: some notes on pandas. Pandas dataframes are a different beast than R dataframes, here are some tips to help you get up to speed:


Hello all, here are some pandas tips to help you guys through this homework:

Indexing and Selecting: .loc and .iloc are the analogs for base R subsetting, or filter() in dplyr

Group By: This is the pandas analog to group_by() and the appended function the analog to summarize(). Try out a few examples of this, and display the results in Jupyter. Take note of what's happening to the indexes, you'll notice that they'll become hierarchical. I personally find this more of a burden than a help, and this sort of hierarchical indexing leads to a fundamentally different experience compared to R dataframes. Once you perform an aggregation, try running the resulting hierarchical datafrome through a reset_index().

Reset_index: I personally find the hierarchical indexes more of a burden than a help, and this sort of hierarchical indexing leads to a fundamentally different experience compared to R dataframes. reset_index() is a way of restoring a dataframe to a flatter index style. Grouping is where you'll notice it the most, but it's also useful when you filter data, and in a few other split-apply-combine workflows. With pandas indexes are more meaningful, so use this if you start getting unexpected results.

Indexes are more important in Pandas than in R. If you delve deeper into the using python for data science, you'll begin to see the benefits in many places (despite the personal gripes I highlighted above.) One place these indexes come in handy is with time series data. The pandas docs have a huge section on datetime indexing. In particular, check out resample, which provides time series specific aggregation.

Merging, joining, and concatenation: There's some overlap between these different types of merges, so use this as your guide. Concat is a single function that replaces cbind and rbind in R, and the results are driven by the indexes. Read through these examples to get a feel on how these are performed, but you will have to manage your indexes when you're using these functions. Merges are fairly similar to merges in R, similarly mapping to SQL joins.

Apply: This is explained in the "group by" section linked above. These are your analogs to the plyr library in R. Take note of the lambda syntax used here, these are anonymous functions in python. Rather than predefining a custom function, you can just define it inline using lambda.

Browse through the other sections for some other specifics, in particular reshaping and categorical data (pandas' answer to factors.) Pandas can take a while to get used to, but it is a pretty strong framework that makes more advanced functions easier once you get used to it. Rolling functions for example follow logically from the apply workflow (and led to the best google results ever when I first tried to find this out and googled "pandas rolling")

Google Wes Mckinney's book "Python for Data Analysis," which is a cookbook style intro to pandas. It's an O'Reilly book that should be pretty available out there.


Question

After a few building collapses, the City of New York is going to begin investigating older buildings for safety. The city is particularly worried about buildings that were unusually tall when they were built, since best-practices for safety hadn’t yet been determined. Create a graph that shows how many buildings of a certain number of floors were built in each year (note: you may want to use a log scale for the number of buildings). Find a strategy to bin buildings (It should be clear 20-29-story buildings, 30-39-story buildings, and 40-49-story buildings were first built in large numbers, but does it make sense to continue in this way as you get taller?)

I am going to take a look at the data first.

In [5]:
# take a quick look at the data
ny.head(n=10)
Out[5]:
borough block lot cd ct2010 cb2010 schooldist council zipcode firecomp ... dcasdate zoningdate landmkdate basempdate masdate polidate edesigdate geom dcpedited notes
0 BK 834 46 307.0 106.0 2001.0 20.0 38.0 11220.0 L114 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000005... NaN NaN
1 QN 4042 106 407.0 929.0 3000.0 25.0 19.0 11356.0 E297 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000007... NaN NaN
2 BK 4679 17 317.0 866.0 3002.0 18.0 41.0 11203.0 L174 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000006... NaN NaN
3 BK 7831 6 318.0 676.0 1002.0 22.0 46.0 11234.0 L159 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000005... NaN NaN
4 BK 7831 7 318.0 676.0 1002.0 22.0 46.0 11234.0 L159 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000005... NaN NaN
5 BK 7831 8 318.0 676.0 1002.0 22.0 46.0 11234.0 L159 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000005... NaN NaN
6 BK 7831 9 318.0 676.0 1002.0 22.0 46.0 11234.0 L159 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000005... NaN NaN
7 BK 7831 13 318.0 676.0 1002.0 22.0 46.0 11234.0 L159 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000005... NaN NaN
8 BK 8797 50 315.0 622.0 2005.0 22.0 48.0 11235.0 L169 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000006... NaN NaN
9 BK 7832 18 318.0 676.0 1003.0 22.0 46.0 11234.0 E323 ... NaN NaN NaN NaN NaN NaN NaN 0106000020E61000000100000001030000000100000006... NaN NaN

10 rows × 99 columns

I only need a few columns, so I will make a new dataframe with just those columns. The notes in the data dictionary mention that in early years, the actual year the building was built was not known and consequently the year built was estimated at either the mid-decade or end of the decade. This is why the graph above looked strange with all the peaks. To account for this, I am going to make a new column for the decade the building was built. In addition, the variable for number of floors is sometimes a fraction. I am going to round all of these fractional floor numbers up to the next floor.

In [6]:
# subset to only the columns needed
buildings = ny[['yearbuilt', 'numfloors', 'bbl']]
# create a new column with the decade of year built
buildings['decade'] = (buildings.yearbuilt/10).astype(int)*10
# round up number of floors when fractional
buildings['floors_new'] = np.ceil(buildings['numfloors'])
buildings.head(n = 10)
C:\Users\spiec\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\spiec\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[6]:
yearbuilt numfloors bbl decade floors_new
0 1931.0 3.00 3008340046 1930 3.0
1 1910.0 2.75 4040420106 1910 3.0
2 1920.0 2.00 3046790017 1920 2.0
3 1920.0 2.00 3078310006 1920 2.0
4 1920.0 2.00 3078310007 1920 2.0
5 1920.0 2.00 3078310008 1920 2.0
6 1920.0 2.00 3078310009 1920 2.0
7 1920.0 2.00 3078310013 1920 2.0
8 1920.0 1.00 3087970050 1920 1.0
9 1955.0 2.00 3078320018 1950 2.0

Now I want to look at some summary data for the number of floors and year built.

In [7]:
# look at some summary data for the number of floors
buildings['numfloors'].describe()
Out[7]:
count    811684.000000
mean          2.457334
std           1.982234
min           0.100000
25%           2.000000
50%           2.000000
75%           2.500000
max         205.000000
Name: numfloors, dtype: float64
In [8]:
buildings['yearbuilt'].describe()
Out[8]:
count    811684.000000
mean       1941.003352
std          30.440113
min        1851.000000
25%        1920.000000
50%        1931.000000
75%        1960.000000
max        2019.000000
Name: yearbuilt, dtype: float64
In [9]:
# summarise the number of floors by decade
decade_numfloors = buildings['numfloors'].groupby(buildings['decade'])
decade_numfloors.describe()
Out[9]:
count mean std min 25% 50% 75% max
decade
1850 1555.0 3.537460 0.919151 1.00 3.0 3.00 4.0 11.0
1860 1548.0 3.615956 1.262894 1.00 3.0 3.00 4.0 27.0
1870 2736.0 3.377741 1.082093 1.00 3.0 3.00 4.0 15.0
1880 5111.0 3.473782 1.334794 1.00 3.0 3.00 4.0 46.0
1890 26083.0 2.749249 1.077781 1.00 2.0 2.75 3.0 29.0
1900 41648.0 2.963619 1.590282 1.00 2.0 2.50 3.0 51.0
1910 62776.0 2.798702 1.619104 1.00 2.0 2.00 3.0 60.5
1920 177351.0 2.483459 1.549883 0.50 2.0 2.00 2.5 205.0
1930 137185.0 2.319549 1.290031 0.10 2.0 2.00 2.5 102.0
1940 66269.0 1.927287 0.943929 0.21 1.5 2.00 2.0 90.0
1950 79673.0 1.865907 1.381795 1.00 1.0 2.00 2.0 57.0
1960 62691.0 2.276902 2.364505 1.00 2.0 2.00 2.0 60.0
1970 32577.0 2.336774 2.694747 0.10 2.0 2.00 2.0 60.0
1980 27261.0 2.624441 3.303439 1.00 2.0 2.00 3.0 78.0
1990 28962.0 2.462056 1.950863 1.00 2.0 2.00 3.0 62.0
2000 43115.0 2.899843 2.938829 0.30 2.0 2.00 3.0 114.0
2010 15143.0 4.218381 5.751172 0.10 2.0 3.00 4.0 90.0

The number of floors ranges from less than 1 to 205. The median is 2 floors, so most of the buildings must be very few floors. Years built range from 1851 to 2019. Looking at the number of floors by decade shows that the median for all decades is 2 to 3 floors. The 1920s and 1930s have the most records, so most of the buildings were built in those decades. Here is a graph showing the number of buildings by decade.

In [10]:
# Buildings Count vs Decade
trace = go.Bar(
    x = buildings.groupby('decade').count()['bbl'].index,
    y = buildings.groupby('decade').count()['bbl']
)

layout = go.Layout(
    xaxis = dict(title = 'Decade'),
    yaxis = dict(title = 'Number of Buildings')
)

fig2 = go.FigureWidget(data = [trace], layout = layout)

fig2.show()

The scatter plot below shows the number of buildings by the number of floors. I used the log scale for the number of buildings because based on the previous summary data, we can see that the data is right skewed (i.e. there are many buildings with a small number of floors and very few buildings with a large number of floors). The scatter plot definitely shows this skew.

In [11]:
# Buildings Count vs Number of floors
trace = go.Scatter(
    x = buildings.groupby('floors_new').count()['bbl'].index,
    y = buildings.groupby('floors_new').count()['bbl']
    ,
   mode = 'markers'
)

layout = go.Layout(
    xaxis = dict(title = 'Number of Floors'),
    yaxis = dict(title = 'Number of Buildings (log scale)', type = 'log', dtick = 1)
)

fig3 = go.FigureWidget(data = [trace], layout = layout)

fig3.show()

To get a sense of the distribution of building heights by decade I will look at some box plots. I decided to limit the range of the number of floors in the graph in order to see the distributions better. Without limiting the floors, all of the boxplots would have been very tiny. In all decades, most buildings are less than 4 floors.

In [12]:
trace = go.Box(
    x = buildings['decade'],
    y = buildings['numfloors'],
    marker_size = 1
)

layout = go.Layout(
    xaxis = dict(title = 'Decade'),
    yaxis = dict(title = 'Number of Floors', range = (0,10)),
    title = dict(text = 'Box Plots of Number of Floors by Decade (displaying only 0-10 floors)')
)

fig4 = go.FigureWidget(data = [trace], layout = layout)

fig4.show()

Based on all the summary data above, I will make bins for number of floors as follows: <= 5, 6-10, 11-15, 16-20, 21-30, 31-40, 41-60, 61-80, 81-100, 101+

These bin sizes will help to account for more buildings with lower numbers of floors and less buildings as the number of floors increases.

In [13]:
def bins (row):
   if row['numfloors'] <= 5 :
      return '0-5'
   elif row['numfloors'] <= 10 :
      return '6-10'
   elif row['numfloors'] <= 15 :
      return '11-15'
   elif row['numfloors'] <= 20 :
      return '16-20'
   elif row['numfloors'] <= 30 :
      return '21-30'
   elif row['numfloors'] <= 40 :
      return '31-40'
   elif row['numfloors'] <= 60 :
      return '41-60'
   elif row['numfloors'] <= 80 :
      return '61-80'
   elif row['numfloors'] <= 100 :
      return '81-100'
   return '101+'
In [14]:
buildings['floor_bins'] = buildings.apply(bins, axis=1)
C:\Users\spiec\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [15]:
buildings.head(n=10)
Out[15]:
yearbuilt numfloors bbl decade floors_new floor_bins
0 1931.0 3.00 3008340046 1930 3.0 0-5
1 1910.0 2.75 4040420106 1910 3.0 0-5
2 1920.0 2.00 3046790017 1920 2.0 0-5
3 1920.0 2.00 3078310006 1920 2.0 0-5
4 1920.0 2.00 3078310007 1920 2.0 0-5
5 1920.0 2.00 3078310008 1920 2.0 0-5
6 1920.0 2.00 3078310009 1920 2.0 0-5
7 1920.0 2.00 3078310013 1920 2.0 0-5
8 1920.0 1.00 3087970050 1920 1.0 0-5
9 1955.0 2.00 3078320018 1950 2.0 0-5

Now I will summarize the counts of buildings by decade and floor bin.

In [16]:
dec_bins = buildings.groupby(['decade', 'floor_bins']).count()['bbl']
dec_bins
Out[16]:
decade  floor_bins
1850    0-5           1506
        11-15            1
        6-10            48
1860    0-5           1485
        11-15            3
                      ... 
2010    31-40           73
        41-60           59
        6-10          1768
        61-80           27
        81-100           5
Name: bbl, Length: 115, dtype: int64

Once again I will apply log to the count of the buildings since there is skew.

In [17]:
dec_bins = dec_bins.apply(np.log)
bins_df = dec_bins.to_frame().reset_index()
bins_df.head()
Out[17]:
decade floor_bins bbl
0 1850 0-5 7.317212
1 1850 11-15 0.000000
2 1850 6-10 3.871201
3 1860 0-5 7.303170
4 1860 11-15 1.098612

Now I will make a heatmap to show the count of buildings (log scale) by decade and the floor bins. This will provide a visual to help the building inspector determine buildings that are unusually tall for when they were built.

In [18]:
order = ['0-5', '6-10', '11-15', '16-20', '21-30', '31-40', '41-60', '61-80', '81-100', '101+']
fig_heat = go.Figure(data=go.Heatmap(
                    z=bins_df['bbl'],
                    x=bins_df['decade'],
                    y=bins_df['floor_bins'],
                    hoverongaps = False,
                    colorscale = 'Jet'),
                layout=go.Layout(
                    plot_bgcolor='white',
                    title='Heatmap Buildings (log scale) by Height and Decade',
                    yaxis={'title':'Number of Floors','categoryarray': order},
                    xaxis={'title': ''}))
fig_heat.show()

Based on the heatmap, we see that prior to 1890 buildings with more than 10 floors were highly unusual. From 1900-1910 buildings with more than 30 floors were rare. In the 1920s and 1930s, 60 floors and above were uncommon. From 1980 - 2009, again buildings with more than 60 floors were rare. And finally in the 2010s. buildings with more than 80 floors were uncommon. Using this heatmap, we can clearly see how building heights have changed over time.

Part 2: Datashader

Datashader is a library from Anaconda that does away with the need for binning data. It takes in all of your datapoints, and based on the canvas and range returns a pixel-by-pixel calculations to come up with the best representation of the data. In short, this completely eliminates the need for binning your data.

As an example, lets continue with our question above and look at a 2D histogram of YearBuilt vs NumFloors:

In [19]:
yearbins = 200
floorbins = 200

yearBuiltCut = pd.cut(ny['yearbuilt'], np.linspace(ny['yearbuilt'].min(), ny['yearbuilt'].max(), yearbins))
numFloorsCut = pd.cut(ny['numfloors'], np.logspace(1, np.log(ny['numfloors'].max()), floorbins))

xlabels = np.floor(np.linspace(ny['yearbuilt'].min(), ny['yearbuilt'].max(), yearbins))
ylabels = np.floor(np.logspace(1, np.log(ny['numfloors'].max()), floorbins))

fig = go.FigureWidget(
    data = [
        go.Heatmap(z = ny.groupby([numFloorsCut, yearBuiltCut])['bbl'].count().unstack().fillna(0).values,
              colorscale = 'Greens', x = xlabels, y = ylabels)
    ]
)

fig.show()

This shows us the distribution, but it's subject to some biases discussed in the Anaconda notebook Plotting Perils.

Here is what the same plot would look like in datashader:

In [20]:
cvs = ds.Canvas(800, 500, x_range = (ny['yearbuilt'].min(), ny['yearbuilt'].max()), 
                                y_range = (ny['numfloors'].min(), ny['numfloors'].max()))
agg = cvs.points(ny, 'yearbuilt', 'numfloors')
view = tf.shade(agg, cmap = cm(Greys9), how='log')
export(tf.spread(view, px=2), 'yearvsnumfloors')
Out[20]:

That's technically just a scatterplot, but the points are smartly placed and colored to mimic what one gets in a heatmap. Based on the pixel size, it will either display individual points, or will color the points of denser regions.

Datashader really shines when looking at geographic information. Here are the latitudes and longitudes of our dataset plotted out, giving us a map of the city colored by density of structures:

In [21]:
NewYorkCity   = (( 913164.0,  1067279.0), (120966.0, 272275.0))
cvs = ds.Canvas(700, 700, *NewYorkCity)
agg = cvs.points(ny, 'xcoord', 'ycoord')
view = tf.shade(agg, cmap = cm(inferno), how='log')
export(tf.spread(view, px=2), 'firery')
Out[21]:

Interestingly, since we're looking at structures, the large buildings of Manhattan show up as less dense on the map. The densest areas measured by number of lots would be single or multi family townhomes.

Unfortunately, Datashader doesn't have the best documentation. Browse through the examples from their github repo. I would focus on the visualization pipeline and the US Census Example for the question below. Feel free to use my samples as templates as well when you work on this problem.

Question

You work for a real estate developer and are researching underbuilt areas of the city. After looking in the Pluto data dictionary, you've discovered that all tax assessments consist of two parts: The assessment of the land and assessment of the structure. You reason that there should be a correlation between these two values: more valuable land will have more valuable structures on them (more valuable in this case refers not just to a mansion vs a bungalow, but an apartment tower vs a single family home). Deviations from the norm could represent underbuilt or overbuilt areas of the city. You also recently read a really cool blog post about bivariate choropleth maps, and think the technique could be used for this problem.

Datashader is really cool, but it's not that great at labeling your visualization. Don't worry about providing a legend, but provide a quick explanation as to which areas of the city are overbuilt, which areas are underbuilt, and which areas are built in a way that's properly correlated with their land value.

First, I create a subset with just the columns I need. I also need the assessed value of the structure. The variable 'structure' is calculated as the total assessed value minus the assessed value of the land. Summary data for the assessed land and assessed structure values are displayed.

In [22]:
# subset of only the needed columns
bi_map = ny[['assesstot', 'assessland', 'xcoord', 'ycoord']].copy()
# create variable for the assessment of the structure
bi_map['structure'] = bi_map['assesstot'] - bi_map['assessland']

# look at some summary data
bi_map['assessland'].describe()
Out[22]:
count    8.116840e+05
mean     1.081384e+05
std      3.988562e+06
min      0.000000e+00
25%      1.038000e+04
50%      1.416000e+04
75%      2.178000e+04
max      3.211276e+09
Name: assessland, dtype: float64
In [23]:
# look at some summary data
bi_map['structure'].describe()
Out[23]:
count    8.116840e+05
mean     4.126351e+05
std      7.106368e+06
min      0.000000e+00
25%      2.382000e+04
50%      3.660000e+04
75%      6.894000e+04
max      3.924465e+09
Name: structure, dtype: float64

I see there are records where the assessed land is 0 and also some where the assessed structure is 0. I remove these records, since they provide no value.

In [24]:
# remove records with assessed land = 0
bi_map = bi_map[(bi_map['assessland']!=0)]
bi_map['assessland'].describe()
Out[24]:
count    8.112830e+05
mean     1.081919e+05
std      3.989547e+06
min      6.000000e+01
25%      1.038000e+04
50%      1.416000e+04
75%      2.178000e+04
max      3.211276e+09
Name: assessland, dtype: float64
In [25]:
# remove records with assessed structure = 0
bi_map = bi_map[(bi_map['structure']!=0)]
bi_map['structure'].describe()
Out[25]:
count    8.091870e+05
mean     4.139060e+05
std      7.117287e+06
min      4.500000e+01
25%      2.388000e+04
50%      3.672000e+04
75%      6.918000e+04
max      3.924465e+09
Name: structure, dtype: float64

Following the strategy from https://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/, I create 3 equal bins each for the assessed land value and the assessed structure value. Then categorize each building into one of these 9 possible combination bins.

In [26]:
# Create 3 bins for the assessed land and assessed structure
bi_map['struct_bin']=pd.qcut(bi_map['structure'],q=3,labels=['A','B','C'])
bi_map['land_bin']=pd.qcut(bi_map['assessland'],q=3,labels=['1','2','3'])

# Categorize each building into the bins from above, 9 total combinations
bi_map['comb_bin'] = bi_map['struct_bin'].astype(str) + bi_map['land_bin'].astype(str)
bi_map['comb_bin'] = pd.Categorical(bi_map['comb_bin'])
bi_map.head()
Out[26]:
assesstot assessland xcoord ycoord structure struct_bin land_bin comb_bin
0 350550.0 146250.0 982211.0 171707.0 204300.0 C 3 C3
1 78900.0 12240.0 1026895.0 225880.0 66660.0 C 2 C2
2 34380.0 18120.0 1004527.0 177269.0 16260.0 A 3 A3
3 24600.0 7680.0 1004804.0 166580.0 16920.0 A 1 A1
4 29760.0 8160.0 1004784.0 166579.0 21600.0 A 1 A1

I use one of the color palettes from https://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/ to map the combination bins to colors for use in the map.

In [27]:
color_map = {'A3': '#9972af', 'B3': '#976b82', 'C3': '#804d36', 
             'A2': '#cbb8d7', 'B2': '#c8ada0', 'C2': '#af8e53', 
             'A1': '#e8e8e8', 'B1': '#e4d9ac', 'C1': '#c8b35a'}

Now, I will use the code above, but apply the combination bin data and color map.

In [28]:
NY_Value   = (( 913164.0,  1067279.0), (120966.0, 272275.0))
cvs = ds.Canvas(700, 700, *NY_Value)
agg = cvs.points(bi_map, 'xcoord', 'ycoord',ds.count_cat('comb_bin'))
view = tf.shade(agg, color_key=color_map)
export(tf.spread(view, px=2), 'Question2')
Out[28]:

The dark brown is high assessed land and high assessed structure. That is seen a lot in Manhattan and parts of Brooklyn and most coastal areas. We are interested in areas that are underbuilt or overbuilt. The purples are areas underbuilt. This would be eastern parts of Brooklyn and southern parts of Queens. The yellow-brown is overbuilt and that shows up in western parts of Brooklyn and northern parts of Queens. Parts of Staten Island are also underbuilt.